Skip to content

deps: Update dependency pytorch to >=2.7.0 #299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

renovate[bot]
Copy link
Contributor

@renovate renovate bot commented May 17, 2025

This PR contains the following updates:

Package Type Update Change
pytorch (source) dependencies minor >=2.6.0 -> >=2.7.0

Release Notes

pytorch/pytorch (pytorch)

v2.7.0: PyTorch 2.7.0 Release

PyTorch 2.7.0 Release Notes
Highlights
Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on X86 CPUs
FlexAttention LLM throughput mode optimization on X86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.

Tracked Regressions
NCCL init hits CUDA failure 'invalid argument' on 12.2 driver

Some users with 12.2 CUDA driver (535 version) report seeing "CUDA driver error: invalid argument" during NCCL or Symmetric Memory initialization. This issue is currently under investigation, see #​150852. If you use PyTorch from source, a known workaround is to rebuild PyTorch with CUDA 12.2 toolkit. Otherwise, you can try upgrading the CUDA driver on your system.

Backwards Incompatible Changes
Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.
C++ Extensions py_limited_api=True is now built with -DPy_LIMITED_API (#​145764)

We formally began respecting the py_limited_api=True kwarg in 2.6 and stopped linking libtorch_python.so when the flag was specified, as libtorch_python.so does not guarantee using APIs from from the stable Python limited API. In 2.7, we go further by specifying the -DPy_LIMITED_API flag which will enforce that the extension is buildable with the limited API. As a result of this enforcement, custom extensions that set py_limited_api=True but do not abide by the limited API may fail to build. For an example, see #​152243.

This is strictly better behavior as it is sketchy to claim CPython agnosticism without enforcing with the flag. If you run into this issue, please ensure that the extension you are building does not use any APIs which are outside of the Python limited API, e.g., pybind.

Change torch.Tensor.new_tensor() to be on the given Tensor's device by default (#​144958)

This function was always creating the new Tensor on the "cpu" device and will now use the same device as the current Tensor object. This behavior is now consistent with other .new_* methods.

Use Manylinux 2.28 and CXX11_ABI=1 for future released Linux wheel builds.

With Migration to manylinux_2_28 (AlmaLinux 8 based), we can no longer support OS distros with glibc2_26. These include popular Amazon Linux 2 and CentOS 7. (#​143423, #​146200, #​148028, #​148135, #​148195, #​148129)

torch.onnx.dynamo_export now uses the ExportedProgram logic path (#​137296)

Users using the torch.onnx.dynamo_export API may see some ExportOptions become
unsupported due to an internal switch to use torch.onnx.export(..., dynamo=True): diagnostic_options, fake_context and onnx_registry are removed/ignored by ExportOptions. Only dynamic_shapes is retained.

Users should move to use the dynamo=True option on torch.onnx.export as
torch.onnx.dynamo_export is now deprecated. Leverage the dynamic_shapes argument in torch.onnx.export for specifying dynamic shapes on the model.

Version 2.6.0

torch.onnx.dynamo_export(model, *args, **kwargs)

Version 2.7.0

torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)
Finish deprecation of LRScheduler.print_lr() along with the verbose kwarg to the LRScheduler constructor. (#​147301)

Both APIs have been deprecated since 2.2. Please use LRScheduler.get_last_lr() to access the learning rate instead.print_lr and verbose were confusing, not properly documented and were little used, as described in #​99270, so we deprecated them in 2.2. Now, we complete the deprecation by removing them completely. To access and print the learning rate of a LRScheduler:

Version 2.6.0

optim = ...
lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, verbose=True)
// lrsched will internally call print_lr() and print the learning rate      

Version 2.7.0

optim = ...
lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim)
print(lrsched.get_last_lr())
libtorch_python.so symbols are now invisible by default on all platforms except Apple (#​142214)

Previously, the symbols in libtorch_python.so were exposed with default visibility. We have transitioned to being more intentional about what we expose as public symbols for our python API in C++. After #​142214, public symbols will be marked explicitly while everything else will be hidden. Some extensions using private symbols will see linker failures with this change.

Please use torch.export.export instead of capture_pre_autograd_graph to export the model for pytorch 2 export quantization (#​139505)

capture_pre_autograd_graph was a temporary API in torch.export. Since now we have a better longer term API: export available, we can deprecate it.

Version 2.6.0

from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
    get_symmetric_quantization_config()
)
m = capture_pre_autograd_graph(m, *example_inputs)
m = prepare_pt2e(m, quantizer)

Version 2.7.0

from torch.export import export
from torch.ao.quantization.quantize_pt2e import prepare_pt2e

##### please get xnnpack quantizer from executorch (https://github.com/pytorch/executorch/)
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
    get_symmetric_quantization_config()
)
m = export(m, *example_inputs)
m = prepare_pt2e(m, quantizer)
New interface for torch.fx.passes.graph_transform_observer.GraphTransformObserver to enable Node Level provenance tracking (#​144277)

We now track a mapping between the nodes in the pre-grad and post-grad graph. See the issue for an example frontend to visualize the transformations. To update your GraphTransformObserver subclasses, instead of overriding on_node_creation and on_node_erase, there are new functions get_node_creation_hook, get_node_erase_hook, get_node_replace_hook and get_deepcopy_hook. These are registered on the GraphModule member of the GraphTransformObserver upon entry and exit of a with block

Version 2.6.0

class MyPrintObserver(GraphTransformObserver):
    def on_node_creation(self, node: torch.fx.Node):
        print(node)

Version 2.7.0

class MyPrintObserver(GraphTransformObserver):
    def get_node_creation_hook(self):
        def hook(node: torch.fx.Node):
            print(node)
        return hook
torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules is no longer public (#​141612)

We are planning to make all functions under torch.ao.quantization.pt2e.graph_utils private. This update marks get_control_flow_submodules as a private API. If you have to or want to continue using get_control_flow_submodules, please make a private call by using _get_control_flow_submodules.

Example:
Version 2.6:

>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodules

Version 2.7:

>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodules
ImportError: cannot import name 'get_control_flow_submodules' from 'torch.ao.quantization.pt2e.graph_utils'
>>> from torch.ao.quantization.pt2e.graph_utils import _get_control_flow_submodules  # Note: Use _get_control_flow_submodules for private access
Deprecations
torch.onnx.dynamo_export is deprecated (#​146425, #​146639, #​146923)

Users should use the dynamo=True option on torch.onnx.export.

Version 2.6.0

torch.onnx.dynamo_export(model, *args, **kwargs)

Version 2.7.0

torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)
XNNPACKQuantizer is deprecated in PyTorch and moved to ExecuTorch, please use it from executorch.backends.xnnpack.quantizer.xnnpack_quantizer instead of torch.ao.quantization.quantizer.xnnpack_quantizer. (#​144940)

XNNPACKQuantizer is a quantizer for xnnpack that was added into pytorch/pytorch for initial development. However, as it is not related to our core quantization workflow, we have moved it to ExecuTorch instead. Please use it from executorch.backends.xnnpack.quantizer.xnnpack_quantizer instead of torch.ao.quantization.quantizer.xnnpack_quantizer.

Version 2.6.0

from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
    get_symmetric_quantization_config()
)
m = capture_pre_autograd_graph(m, *example_inputs)
m = prepare_pt2e(m, quantizer)

Version 2.7.0

##### we also updated the export call
from torch.export import export
from torch.ao.quantization.quantize_pt2e import prepare_pt2e

##### please get xnnpack quantizer from executorch (https://github.com/pytorch/executorch/)
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
    get_symmetric_quantization_config()
)
m = export(m, *example_inputs)
m = prepare_pt2e(m, quantizer)
New features
Release Engineering
Python Frontend
  • Introduce a new torch.utils.serialization.config namespace for all serialization related configurations (#​143324)
  • Add torch.serialization.config.save.use_pinned_memory_for_d2h to speed up torch.save when passed gpu devices (#​143342)
  • Add torch.utils.serialization.config.load.calculate_storage_offsets to reduce random reads and significantly improve performance for storage with bad random access performance (#​143880)
  • Add support for __torch_function__ handler on dtype arguments, similar to subclass objects (#​145085)
C++ Extensions
Distributed
Context Parallel
  • We provided a Context Parallel API (#​131351) for users to parallelize torch.nn.functional.scaled_dot_product_attention over the sequence dimension. We implemented
    Ring Attention (#​131351) and an AllGather-based approach (#​132820) where the all-gather is issued before the first local SDPA
    and the subsequent local SDPAs will have to wait until the all-gather completes, and offered a user API (#​142093) to select the desired approach. The implementation
    currently supports three SDPA kernels: SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, and SDPBackend.CUDNN_ATTENTION (#​148537). We also
    verified that our Context Parallel implementation is compatible with other parallelisms and torch.compile.
c10d
Distributed Checkpoint (DCP)
  • Cache save plans: to mitigate overhead from planning steps (#​147116, #​147343)
  • Build a storage reader/writer to write checkpoints in HF format (#​148089)
CUDA
  • Blackwell support added across native kernels, CUDA math libraries, and torch.compile (#​145270)
  • Make torch.cuda.gds APIs public (#​147120)
MPS
  • Prototype of torch.compile for Metal (#​143893)
  • Provide Metal kernel authoring via Python (#​148972)
ROCm
XPU
torch.compile
Dynamo
  • Support tracing contextlib.contextmanager in Dynamo (#​136033)
  • nonstrict_trace escape hatch to apply non-strict tracing to difficult-to-compile code (#​146367)
  • Delayed compile for dynamic shapes (#​147983)
  • Support tracing generators (#​141055)
  • Whitelist of source files to apply dynamic shapes to (#​147979)
  • Support tracing list subclasses (#​146819)
Inductor
  • Enable non power-of-2 head_dim for FlexAttention (#​133495).
  • Add FlexAttention kernel parameter tuning options: num_warps and num_stages (#​139639).
  • Support vectorization for score and mask in FlexAttention CPU (#​143638).
  • ConfigFuzzer: a new debugging tool designed to fuzz Torch compile configurations. Given a test function, it will identify combinations of configs that throw errors during compilation and execution (#​139736) (#​145565).
  • Support fusion of pointwise ops into Template Prologues. TORCHINDUCTOR_PROLOGUE_FUSION enables this feature (#​147008).
  • Add instantiation level for generating configs in the CUTLASS backend. Set TORCHINDUCTOR_CUTLASS_INSTANTIATION_LEVEL. Consult config.py for information (#​146230).
  • Add L2 Swizzle config for CUTLASS backend: cuda.cutlass_max_profiling_swizzle_options (#​146088).
  • Emit a CMakeLists.txt when package_cpp_only is specified in AOTI (#​143352).
  • One Dynamo graph can now map to multiple inductor graphs with different graph_partition functions. Set the graph_partition in inductor config to enable (#​147038).
Profiler
  • Add overload names to profiler (#​143114)
  • Enable profiling on all threads via experimentalConfig (#​143659)
Quantization
  • Enables kernel from KleidAI to run model that was quantized such that weights are in int4 (with symmetric quantization either using channel-wise or group-wise, with the group size being a multiple of 32), while at runtime the activations are dynamically quantized from fp32 to int8 and weights are upcast from int4 to int8 so that int8 matrix multiplication is executed. This dynamic quantization of activations and matrix multiplication is performed inside of function torch.ops.aten._dyn_quant_matmul_4bit, while the weights, scaled and optional bias are packed in torch.ops.aten._dyn_quant_pack_4bit_weight. To use it on your model you can quantize it using the following example that leverages torchao:
from torchao.dtypes import PlainLayout
from torchao.experimental.packed_linear_int8_dynamic_activation_intx_weight_layout import (
    PackedLinearInt8DynamicActivationIntxWeightLayout,
)
from torchao.experimental.quant_api import (
    int8_dynamic_activation_intx_weight,
)
from torchao.quantization.granularity import (
    PerGroup,
    PerRow,
)
from torchao.quantization.quant_api import quantize_
from torchao.quantization.quant_primitives import MappingType
my_model = Model()
quantize_(
    my_model,
    int8_dynamic_activation_intx_weight(
        weight_dtype=torch.int4,
        granularity=PerGroup(32), # PerRow() is also supported
        has_weight_zeros=True, # Should be True
        weight_mapping_type=MappingType.SYMMETRIC_NO_CLIPPING_ERR # MappingType.SYMMETRIC can also be used but increases error
        layout=PackedLinearInt8DynamicActivationIntxWeightLayout(target="aten"),
    ),
)
ONNX
torch.onnx.verification.verify_onnx_program (#​148396, #​148706, #​148730, #​148707)

A new verification API torch.onnx.verification.verify_onnx_program can now be used to verify numerical accuracy of the exported ONNX model. Users can use the compare_intermediates option to identify any operator that causes numerical discrepancies in intermediate tensors. It is possible to use a tool like model-explorer to visualize the verification results.

  • Support custom axis name through dynamic_shapes (#​146321)
  • torch.onnx.export(dynamo=True) now optimizes the output model by default (#​146187)
Improvements
Release Engineering
Python Frontend
  • Add support for CPU scalar in torch.addcmul (#​143264)
  • Set -DPy_LIMITED_API flag for py_limited_api=True cpp_extensions (#​145764)
  • Add support for serialization for uintx/intx in weights_only (#​147500)
  • Add warning to torch.jit.load (#​143403)
  • Make record/storage alignment in torch.save configurable (#​147788)
  • Support with statement on torch.Stream (#​140138)
Autograd
  • Allow torch.autograd.graph.GradientEdge as torch.autograd.backward outputs #​144744
  • Implement gradient for the residuals of torch.linalg.lstsq #​148526
  • Add deterministic kernel for reflection_pad2d_backward (#​136241)
  • Improve softmax backward pass native CUDA implementation (#​145866)
  • Improve Pareto frontier plot for AutoAC (#​148678)
Dataloader
  • Dataloader distributes tasks to workers as they become available when in_order is False (#​142324)
  • Update pin memory related APIs to not pass device argument. device and pin_memory_device are discouraged and will be deprecated in the future. (#​131858)
Linear Algebra
  • Improve dim argument validation for empty inputs for torch.cum{min,max}. (#​143920)
  • Properly throw an error when trying to sort complex numbers. (#​144113)
Nested Tensor (NJT)
  • Support NJT chunk() backward on batch dim (#​144584)
  • Support remaining *_like factory functions for NJT (#​144889)
  • Improve matmul with NJTs via backward support and composition with dense tensors (#​144587, #​146405)
torch.nn
  • Add strict kwarg to nn.Module.set_submodule and fix bug for non dot-delineated strings (#​143455)
  • Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#​141670)
torch.optim
Build Frontend
  • Make PyTorch with HomeBrew installed OpenMP (#​145870)
  • Enable onednn in pytorch for ppc64le architecture (#​143743)
  • Enable build for Blackwell GPU family (#​145436)
  • Fix OOM whle building on RasberryPi by sharding codegenerated files (#​144364)
C++ Frontend
  • Introduce a new API isAcceleratorExcluded (#​144959)
Distributed
c10d
  • Simplified abort and shutdown by adding both to Backend and ProcessGroup objects (#​148798)
  • Used new_group instead of split_group on non-CUDA device (#​141469)
  • Removed call_guard in pybind object init of c10d (#​143598)
  • Enabled coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#​143735)
  • Preserved PyWork's Python reference counting when used in functional collectives (#​146376)
  • Enabled soft fail bind when agent store active inside TCPStore (#​147465)
  • Made getDefaultBackend more fault tolerant (#​148596)
DistributedDataParallel (DDP)
  • Added init_sync option to control collectives during initialization (#​142824)
  • Decoupled python reducer from compilation mode (#​147123)
FullyShardedDataParallel2 (FSDP2)
  • Clamp reduce_dtype in lazy init (#​143297)
  • Enabled FSDP2 on XPU device (#​143737)
  • Made post-backward condition more robust (#​144781)
  • Enabled MTIA device in FSDP2 library code (#​145842)
  • Avoided resetting version counter of all_gather_output in inference_mode (#​146709)
  • Supported ignoring parameters in FSDP2 (#​146631)
  • Enabled FSDP tests on XPU device (#​147518)
  • Enabled FSDP2 on HPU device (#​148667)
DTensor
  • Added aten.amin/amax to linear_reduction_strategy (#​143747)
  • Added src_data_rank to distribute_tensor API (#​143883)
  • Added strategy for _scaled_mm (#​143760)
  • Added aten.view.dtype op support (#​144404)
  • Enabled sharding prop to handle cross mesh computation (#​147869)
  • Added CuDNN SDPA op support to DTensor (#​148537)
  • Optimized shard_dim_alltoall to use alltoall_single (#​148868)
  • Deprecated _shard_tensor to use src_data_rank=None (#​144171)
  • Added pointwise ops strategy for aten.minimum (#​145816)
TensorParallel
  • Propagated src_data_rank kwarg in TP API (#​144005)
Torch Elastic
  • Added kill logic for current process when killing a worker (#​141060)
  • Made etcd_rendezvous publicly importable (#​145396)
  • Exposed the rendezvous keepalive arguments (#​145228)
Pipelining
  • Added generate_stage_to_rank_mapping utility (#​146193)
  • Removed stage_index_to_group_rank from schedule (#​146217)
CPU
General
  • Implement blend operation for float, double, int in VEC ATen backend for SVE (#​146479)
  • Upgrade submodule oneDNN to v3.7.1 (#​148293)
x86
CUDA
  • Refine CUDA Stream priority (#​143849)
  • Expose sharedMemPerMultiprocessor device property to python (#​143119)
  • Expose remaining sharedMem cudaDeviceProps to python (#​143226)
  • Add range check for embedding_bag on input index >= 0 of cuda device (#​140791)
  • Fix linter warnings (#​147386)
  • Change behavior of pinning memory so it does not init a cuda context if one is not already present (#​145752, #​149033)
  • Add cutlass kernel for rowwise scaled mm on SM 10.0 (blackwell) (#​148421)
  • Add get_stream_from_external API for CUDA backend (#​143799)
  • Update cuDNN-frontend submodule to 1.10.0, used by cuDNN convolution and SDPA integrations (#​145780)
MPS
ROCm
  • Fix TunableOp UTs: Rotating Buffer (#​143172)
  • Enable *_load_dwordx4 ISA for BFloat16 and Half. (#​141397)
  • Fix condition for small tensor tuning (#​144087)
XPU
  • Enable FP64 GEMM (#​140677)
  • Enable Sparse CSR support (#​144722)
  • Improve XPU Stream implemenation(#​141123,#​141119,#​142347)
  • Enable XPU for Inductor MM Triton Kernel Benchmark (#​148237)
  • Align XPU convolution_backward output layout between fake tensor and real output tensor (#​146880)
  • Improve error handling and reporting in CMake files (#​149353)
  • Refine torch.xpu.get_device_properties API error message (#​144379)
  • Enable nested_layer_norm support for XPU (#​148593)
  • Generalize is_big_gpu() check in Inductor (#​143491)
  • Allow XPU device in sparse compressed tensor factory functions (#​147306)
Profiler
  • Add optional flag to profiler to toggle external correlations (#​143314)
  • Add delimeter in memory vizualizer to show where allocation addr begins (#​147461)
  • Add last entry to truncated values in Kineto args (#​148576)
  • Add profiler activity for HPU devices (#​148182)
  • Add HPU availabilities to profiler (#​149115)
torch.compile
Dynamo
  • Better tracing support for user-defined dict subclasses (#​143548)
  • Improved graph break messages for some common graph break sites (#​146525)
  • Improved tracing of exceptions (#​146492)
  • Remove a number of builtin and third-party modules from trace_rules.py skipfiles (#​145856)
  • Remove some specialized variables for specific third-party classes (e.g. transformers ModelOutput) (#​143567)
  • Compiled Autograd dropped annotation requirements for custom autograd functions (#​146229, #​146720)
AOTDispatcher
  • Fix a quadratic compile time edge case during training when you have long parallel chains of compute (#​145082)
  • handle compiling mutations on tangents in custom autograd.Functions (#​141131)
  • handle compiling buffer input mutations of the form buffer.copy_(int) (#​141161)
  • Fix handling of mutable custom operators in compile when used with torch.inference_mode (#​147925)
Dynamic Shapes
Decompositions, FakeTensor and meta tensors

Several operator decomps received improvements/bugfixes:

  • torch._refs.tensor (#​143461)
  • torch._refs.mean (#​147188)
  • linspace (#​147997)
  • addmv (#​143792)
    New meta tensor implementations for a few pytorch operators:
  • nonzero (#​144727)
  • silu, sigmoid, _softmax, embedding (#​147862)
    New fake tensor implementation for a few pytorch operators:
  • unique_consecutive (#​145649)
    Several general FakeTensor improvements
  • force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#​146642)
  • support meta_tensor.to(device='cpu') under fake_mode (#​146729)
Inductor
  • Add profiling support for codegened CPU FlexAttention kernels (#​145894).
  • Other FlexAttention improvements: (#​147765) (#​147435) (#​147010) (#​146657) (#​145059) (#​144938) (#​143299) (#​142281) (#​147918) (#​148857).
  • Add Inductor support for non-power-of-2 cooperative RSPLIT (#​145689).
  • Remove runtime dependency on packaging (#​149125)
  • Add Cutlass support for runtime param choices, starting with swizzle (#​147223).
  • Make Inductor cpp backend enable_floating_point_contract_flag take string. Previously, the only options were "on" or "off". Now the value of INDUCTOR_CPP_ENABLE_FLOATING_POINT_CONTRACT_FLAG will be passed to ffp-contract (#​143450).
  • Add upcasting FP16/BF16 math reductions to FP32 in Triton (#​141052).
  • Support for more types of async_compile pools. Set variable TORCHINDUCTOR_WORKER_START to one of "subprocess", "fork", or "spawn" (#​144491).
  • Create a new benchmarker to replace Triton's do_bench (#​133058).
  • Inplace-padding support for cpp-wrapper (#​145325).
  • New environment variables for emulate_precision_casts: TORCHINDUCTOR_EMULATE_PRECISION_CASTS (#​145948).
  • New environment variables to filter cutlass kernels: TORCHINDUCTOR_CUTLASS_ALLOWLIST and TORCHINDUCTOR_CUTLASS_DENYLIST (#​148161).
  • Add option to disable runtime scalar assertions: TORCHINDUCTOR_SCALAR_ASSERTS (#​146462).
  • Add new inductor configs to compiler bisector: layout_optimization and comprehensive_padding (#​148450).
  • Add an option to skip optimizing generated wrapper code. Set AOT_INDUCTOR_COMPILE_WRAPPER_WITH_O0=1 (#​144866).
  • Support dynamic shape constraints in Export (#​146044).
  • Handle MLIR scf.yield more accurately in user Triton code (#​147762).
  • Support Triton 3.3: add a global_scratch arg, fix cpp_wrapper (#​148051, #​149973).
  • Removed an unnecessarily struct runtime alignment assertion, allowing more flexible use cases of AOTI (#​143236).
  • Support _int_mm in AOTI (#​144571).
  • Support AOTI + CUDAGraphs when calling from Python (#​148601).
  • New post grad pass to remove torch.ops.aten._assert_tensor_metadata.default for AOTI (#​145028).
  • Support basic TorchBind in aot_compile and aoti_compile_and_package (#​148506).
  • Add top level tlparse logging for AOTI (#​147760)
  • Added Inductor dashboard benchmarks (#​144427, #​145791, #​145654, #​145655, #​146449, #​145683, #​141371, #​143223)
  • Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#​149031)
torch.fx
torch.export
serialization
  • Add float8 support in serialization schema (#​143343)
  • Allow pickle protocol overriding for serialization (#​142253)
  • Add serialization support for SymInt inputs in higher-order op subgraphs (#​142284)
  • Unify single-output and multi-output serialization schemas for higher-order op subgraphs (#​143227)
  • Add "+export" logging to de/serialization process (#​145283)
  • Sync model container types to serialization schema (#​145959)
  • Serialize pytree namedtuple field names in input spec (#​145956)
  • Replace builtins.getattr with serializable higher-order-op fo

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate bot added the dependencies Pull requests that update a dependency file label May 17, 2025
Copy link
Contributor Author

renovate bot commented May 17, 2025

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

  • any of the package files in this branch needs updating, or
  • the branch becomes conflicted, or
  • you click the rebase/retry checkbox if found above, or
  • you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: pixi.lock
ExecError: Command failed: pixi lock --no-progress --color=never --quiet

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/time/entry.rs:568:9:
A Tokio 1.x context was found, but it is being shutdown.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/time/entry.rs:568:9:
A Tokio 1.x context was found, but it is being shutdown.
Error:   × failed to solve the conda requirements of 'cuda' 'win-64'
  ╰─▶ Cannot solve the request because of: No candidates were found for
      pytorch >=2.7.0 cuda12*.
      


@lucascolley
Copy link
Member

Blocked on conda/infrastructure#1159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant